## [1] "Summary of fixed.acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Summary of volatile.acidity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## [1] "Summary of citric.acid"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Summary of residual.sugar"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## [1] "Summary of chlorides"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] "Summary of free.sulfur.dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## [1] "Summary of total.sulfur.dioxide"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## [1] "Summary of density"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## [1] "Summary of pH"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## [1] "Summary of sulphates"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
## [1] "Summary of alcohol"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## [1] "Summary of Quality"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Below are some of the observations on the above plots:
density and pH seem to have a normal distribution.
residual.sugar, chlorides, and sulphates seems to have a long tail on the positive side.
fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide and alcohol seem to have an approx poisson distribution.
This dataset has 1599 observations and 13 variables. These 1599 observations correspond to 1599 types of red wines.
Let’s begin with finding the correlation between each independent variable and the depedent variable.
## X fixed.acidity volatile.acidity
## 0.066 0.124 0.391
## citric.acid residual.sugar chlorides
## 0.226 0.014 0.129
## free.sulfur.dioxide total.sulfur.dioxide density
## 0.051 0.185 0.175
## pH sulphates quality
## 0.058 0.251 1.000
Results seems to suggest we don’t none of the indepedent variables have strong correlation with the quality. So, we would need to work with mutiple independent variables to see if we get a stronger correlation with quality.
Not yet. Maybe will update this section, if I do create more variables.
We are going to turn the quality variable into a factor as this will help us make it a classfication problem.
## X fixed.acidity volatile.acidity citric.acid
## X 1.000 0.268 0.009 0.154
## fixed.acidity 0.268 1.000 0.256 0.672
## volatile.acidity 0.009 0.256 1.000 0.552
## citric.acid 0.154 0.672 0.552 1.000
## residual.sugar 0.031 0.115 0.002 0.144
## chlorides 0.120 0.094 0.061 0.204
## free.sulfur.dioxide 0.090 0.154 0.011 0.061
## total.sulfur.dioxide 0.118 0.113 0.076 0.036
## density 0.368 0.668 0.022 0.365
## pH 0.136 0.683 0.235 0.542
## sulphates 0.125 0.183 0.261 0.313
## residual.sugar chlorides free.sulfur.dioxide
## X 0.031 0.120 0.090
## fixed.acidity 0.115 0.094 0.154
## volatile.acidity 0.002 0.061 0.011
## citric.acid 0.144 0.204 0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 0.022
## pH 0.086 0.265 0.070
## sulphates 0.006 0.371 0.052
## total.sulfur.dioxide density pH sulphates
## X 0.118 0.368 0.136 0.125
## fixed.acidity 0.113 0.668 0.683 0.183
## volatile.acidity 0.076 0.022 0.235 0.261
## citric.acid 0.036 0.365 0.542 0.313
## residual.sugar 0.203 0.355 0.086 0.006
## chlorides 0.047 0.201 0.265 0.371
## free.sulfur.dioxide 0.668 0.022 0.070 0.052
## total.sulfur.dioxide 1.000 0.071 0.066 0.043
## density 0.071 1.000 0.342 0.149
## pH 0.066 0.342 1.000 0.197
## sulphates 0.043 0.149 0.197 1.000
volatile.acidity, density, pH, citric.acid, sulphates and alcohol values change as the quality changes.
The strongest relationship is between pH and fixed.acidity(0.683)
Here, we are trying to find the correct combination of variables to distinguish the high quality wines from low quality. We talk about the analysis below.
Below variables, together seems to have a interesting relationships for distingushing between higher quality and lower quality wines - alcohol & chlorides - alcohol & volatile.acidity - alcohol & sulphates - sulphates & volatile.acidity
No, didn’t see any worth mentioning.
A very high number of wines are of the the quality 5 or 6(> 4/5ths).
As the wine quality increases, the median value of variables sulphates, alcohol & citric.acid increase and the median value of variables volatile.acidity, density & pH decrease.
Two variables combinations gives us a little bit of insight into the differenc between wines with higher and lower quality.
I started with printing some basis summary of the dataset and plotting the univariate plots. The dataset looked good for analysis so far. However, once I reached the bivariate analysis and trying to find the impact any of the independent variables have on quality, no clear winner emerged, which was quite discouraging in the begining.
In conlusion, We started with trying to find out variables, individually or in combination, that influence the quality of the wine. We conclude that no single variable can by it’s own predict the quality of the wine. We would need to use mutiple variables and do more analysis, proabbly with more data.